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Abstract 

In this paper, we consider multivariate response regression models with high di¬ 
mensional predictor variables. One way to model the correlation among the response 
variables is through the low rank decomposition of the coefficient matrix, which has 
been considered by several papers for the high dimensional predictors. However, all 
these papers focus on the singular value decomposition of the coefficient matrix. Our 
target is the decomposition of the coefficient matrix which leads to the best lower rank 
approximation to the regression function, the signal part in the response. Given any 
rank, this decomposition has nearly the smallest expected prediction error among all 
approximations to the the coefficient matrix with the same rank. To estimate the de¬ 
composition, we formulate a penalized generalized eigenvalue problem to obtain the 
first matrix in the decomposition and then obtain the second one by a least squares 
method. In the high-dimensional setting, we establish the oracle inequalities for the 
estimates. Compared to the existing theoretical results, we have less restrictions on 
the distribution of the noise vector in each observation and allow correlations among 
its coordinates. Our theoretical results do not depend on the dimension of the mul¬ 
tivariate response. Therefore, the dimension is arbitrary and can be larger than the 
sample size and the dimension of the predictor. Simulation studies and application 
to real data show that the proposed method has good prediction performance and is 
efficient in dimension reduction for various reduced rank models. 


Keyword: multivariate regression; high dimensional predictors; signal extraction; dimen¬ 
sion reduction; best lower rank approximation; oracle inequalities. 



1 Introduction 


In this paper, we consider multivariate response regression models with high dimensional 
predictor variables. Several methods have been proposed to estimate the coefficient ma¬ 
trix and select a common subset of explanatory variables. The sparse partial least square 


method (Chun and Keles, 2010) finds sparse linear combinations of the original predictors to 


maximize their covariance with response variables. Turlach et al. (2005), Simila and Tikka 


(2007), Peng et al. (2010), Chen and Huang (2012), Chen et al. ( 2012[ ), and Bunea et al. 


(2012) estimate the coefficient matrix by minimizing the penalized (joint) residual sum of 


squares with different penalties. Simila and Tikka (2007) and Turlach et al. (2005) assume 
row sparsity of the coefficient matrix and use group-Lasso type penalties with l 2 or l^ norm 


that treat each row of the regression coefficient matrix as a group. Peng et al. (2010) im¬ 
poses both row-wise and element-wise sparsity on the coefficient matrix. In addition to the 


row-wise sparsity assumption, Chen and Huang (2012), Chen et al. (2012) and Bunea et al. 


(2012) make the reduced rank assumption on the regression coefficient matrix (Izenman 


1975 Reinsel and Velu 1998). Under this assumption, Chen and Huang (2012) and Chen 


et al. (2012) aim to estimate the singular value decomposition (SVD) of the coefficient ma¬ 


trix. Chen and Huang (2012) specify the rank by the cross-validation method, and use a 
group-Lasso type penalty on the first matrix in the SVD of the coefficient matrix to achieve 


sparsity. Chen et al. (2012) pre-specifies the rank by existing methods as given in Anderson 


(2002), Camba-Mendez et al. (2003) or Bunea et al. (2011), and impose an adaptive-lasso 


type penalty. Bunea et al. (2012) penalizes on the rank sparsity and the variable sparsity 
simultaneously, and provides theoretical results in the high-dimensional settings. 

We also assume the reduced-rank structure and the row-wise sparsity on the coefficient 
matrix. Instead of the SVD of the coefficient matrix, our target is the decomposition of 
the coefficient matrix which leads to the best lower rank approximation to the regression 
function, the product of the design matrix and the coefficient matrix. Given any rank, this 
decomposition has the smallest approximation error to the regression function and nearly 
the smallest expected prediction error among all approximations to the coefficient matrix 
with the same rank. Therefore, our proposed method is expected to have good prediction 
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performance and be efficient in dimension reduction. To estimate this decomposition, we 
first propose a penalized generalized eigenvalue problem to obtain the first lower rank matrix 
in this decomposition, and then obtain the second by a least squares method. In the high¬ 
dimensional setting, we establish the oracle inequalities for the estimators of the lower rank 


matrices, the coefficient matrix and the estimated regression function, respectively. Bunea 


et al. (2012) provides the convergence rate of the estimate of the regression function under 


the assumption that the coordinates of the noise vector are identically independently dis¬ 
tributed normal randoms variables. The convergence rate depends on the dimension of the 
multivariate response variable. In order that the convergence rate goes to zero, the increase 
of the dimension of the multivariate response variable has to be slower than the sample size. 
We make weaker assumption on the distribution of the noise vector and allow correlations 
among its coordinates. Our theoretical results do not depend on the dimension of the multi¬ 
variate response. Therefore, the dimension of the multivariate response can be arbitrary and 
even larger than the sample size and the dimension of the predictor. Through simulation 
studies on the reduced rank models with various settings, we demonstrate that our method 
has competitive predictive ability and is efficient in dimension reduction. 

The paper is organized as follows. In Section 2, we formulate the best lower rank ap¬ 
proximation problem and establish its equivalence with a generalized eigenvalue problem. 
In Section 3, in the high dimensional settings, we propose sparse estimates of the decompo¬ 
sition and the coefficient matrix. We establish the oracle inequalities for the estimates. In 
Section 4, we discuss the choice of the number of components and tuning parameters. We 
conduct simulation studies and a case study in Sections 5 and 6, respectively, and summarize 
the paper with discussion in Section 7. All the proofs can be found in the supplementary 
material. 
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2 Signal extraction approach to multivariate regres¬ 


sion (SiER) 


We consider the following linear regression model with responses taking values in R 9 , where 
q > 1. Suppose that the i- th observation satisfies y \ = x i i/3 1 + Xi2/3 2 + • • • + x ip (3 p + 

1 < i < n. Here Xij e R is the i-th observed value of the j-th predictor variable, 
y, = (yn, Vi 2 , ■ ■ ■ , y iq ) T e M 9 and Si = (£a, £ i2l • • • , £ iq ) r E M 9 are the z-th observed response 
vector and the z-th noise vector, respectively, and each coefficient /3 ] = (f3ji,/3j 2 , ■ ■ ■ ,/3j q ) T , 
j = 1,... ,p, is a g-dimensional vector. Let x,- = (xn,Xi 2 , ■ ■ ■ ,x* p ) T denote the z-th obser¬ 
vation of the p-dimensional predictor vector, Y = [y 1; • • • ,y n ] T the n x q response matrix, 
X = [x 1; • • • , x n ] T the nxp design matrix, B = [/3 1; • • ■ , (3 p ] T the px q coefficient matrix, and 
£ = [ei, ■ • • , s n ] T the n x q random noise matrix, respectively. We assume that £i, • • • , £ n , 
are i.i.d. random vectors with E[ei\ = 0. Correlations are allowed among coordinates of e lp 
1 < i < n. Throughout this paper, we will assume that X is nonrandom and the column 


means of X are all zeros, the same setting as in Bickel et al. (2009). Then the model can be 
written as 


Y = XB + e . 


( 2 . 1 ) 


In this paper, we make the reduced-rank assumption, that is, the rank of B, denoted by K, 
is small compared to n, p and q. As the dimension of the subspace spanned by {/3 1 , • • • , f3 p } 
in M 9 is equal to K, given any K linearly independent vectors wy, • ■ ■ , w k in this subspace, 
each coefficient vector /3 - can be expressed as a linear combination of wy, • • • , w k- So we 
have the decomposition 

B = AW t = aywj 1 + • • • + ct K WK, (2.2) 


where A = [ay, • • • , olk\ and W = [wj, • • • , w^] are p x K and q x K matrices, respectively. 


There are inhnitely many choices of wy, • • • , w^, hence the decomposition (2.2) is not unique. 


Chen and Huang (2012) and Chen et ail (2012) consider the SVD, B = UDV T , where U 


and V are p x K and q x K matrices with orthonormal columns, respectively, and D is a 
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K x K nonnegative diagonal matrix. This SVD is a special case of (|2.2[) with 


A = UD and W = V. 


(2.3) 


We will consider a different decomposition which leads to the best lower rank approxi¬ 
mation to X£>, the signal in Y. Specifically, we want to find A and W such that for any 
1 < k < K, we have 


|X£> — X x « jW 

3 =1 


J\\f = nrin IIX# 

J rjeK n ,v^eR 9 , 

1 <j<k 


J2 ' i /•• 

i=i 


(2.4) 


where 


is the Frobenius norm and is define as II m IIf = \/E;=i E,=i M l for any s x t 


matrix M. Therefore, (2.4) implies that E = 1 -^- Q b w j * s the best ran k k approximation to 


XH for any 1 < k < K. Note that when we change ctj to otj/c and w j to cWj, where c is any 
nonzero scalar, oijwj is unchanged. Hence, we restrict that ctj S ctj = 1 for any 1 < j < K, 
where S = X T X/n. To find A and W, we consider the SVD of X£>, 


XB = (7,7, uf + <t 2 72 u 2 4 -- <7 R -7 A .uJ-, 


(2.5) 


where a\ > 02 > • • • > ok > 0 are singular values of X£>, j k e M n and u^, e R q are the 
left-singular and right-singular vectors corresponding to cr*,, respectively, with ||u*.|| 2 = 1, 
Il'YfcII 2 = 1- By the Eckart-Young Theorem, Ej=i <T j^Yj vt J i s the best rank k approximation 
to X£>. We dehne the columns of W and A as 


w k = -yp=U k , Oik = 1 < k < K, 

at 


n 


( 2 . 6 ) 


respectively. As U; : , 1 < k < K , are orthonormal, by (2.5) and (2.6), 


Xct k = ^Xt3w k = —XBu k = ^nj kl 1 <k< K. 


at 




(2.7) 


Therefore, J2j=i Xa i w J = V- is the best rank k approximation to XH and 

ot]:Sot k = 1, for any 1 < k < K. 

We visually illustrate the difference in approximating XH using the two decompositions 


of the coefficient matrix: SVD of B given by (2.3) and our decomposition given by (2.6) 


based on the SVD of X£>, in Figure [l] through an example. We take n = 100, p = 1000 and 
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Figure 1: Relative squared low rank approximation error to X£> based on our decomposition 
(red) and the SVD (blue) of B. The x-axis is the rank of the approximation matrix. 


q = 100. Each row of X is generated from a p-dimensional multivariate normal distribution 
with mean zero. Its covariance matrix has all the diagonal elements equal to 1 and all the 
off-diagonal elements equal to 0.7. We generate B using B = CD, where C is p x 25 and D 
is 25 x q. Elements in the first 40 rows of C are independently generated from 1V(0,1), and 
the other p — 40 rows of C are zeros. Elements in D are independently generated from the 
uniform distribution between —1 and 1. Therefore, in this example, the rank of B is K = 25. 
We perform the simulation 100 times. In each repeat, for each 1 < k < K, we calculate the 
relative squared approximation error to X£> defined by ||X£> — X£>fc||p/||X£>||fi, where Bk is 
the rank k approximation matrix to B using our decomposition (red in Figure [I]) or the SVD 
(blue). In Figure [lj we plot the relative error for 10 repeats in the left panel, and plot the 
mean relative error over 100 repeats in the right panel, when k changes from 1 to K = 25. 

In the following, for any 1 < k < K, let Bk = aiw[ + • • • + the sum of the 


first k terms of our decomposition. We will show in Section 3.2 that the property that 
X£>fc is the best rank k approximation to X£? leads to the property that Bk has nearly the 
smallest expected prediction errors among all possible rank k approximation to B under the 
row-wise sparsity assumption on B. Our decomposition of B leads to the following model 
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transformation, 


Y = X£> + e = TW 1 + e = tiwy + • • • + + e, 


( 2 . 8 ) 


where 


T = [ti, • • • , t K \ , and t j = Xaq = i/rry ■, 1 < j < K , 


(2.9) 


are new orthogonal predictors. 

To estimate the decomposition, we first estimate op,-- - , a# based on the following 


theorem, then estimate t 1; • • • , tx by (2.9). Finally, based on model (2.8) and the least 


squares method, we obtain the estimates of wq, • • • , w k- Define 

Z = ^X, S = -(X£)(X£) t = ZBB T Z T , S = -X T X = Z T Z, 
n n n 


b = 4x t CXB) (XB) t X = SBB t S = Z t HZ. 
n z 


( 2 . 10 ) 


Theorem 2.1. Suppose that the n x n nonnegative definite matrix H has K positive eigen¬ 
values /ii(S) > /x 2 (S) > • • • > plr{ S) > 0. Then 


(a). K < min {n,p,q}. For any 1 < k < K, ct k in (2.6) is the solution to 


max ck t Bq(, subject to a^Scx = 1, a^Sa = 0, 1 < l < k — 1. (2.11) 

aeMP ’ 1 V ' 


Moreover, the maximum value of ( 2.11) is equal to ctfHct k = // fc (3). 


(b). In the SVD (2.5) ofXB, the singular values are o k = 1 < k < K. The 

left-singular vectors satisfy 


7i = Za 1; 7 2 = Za 2 , i k = Zol k , 


( 2 . 12 ) 


and they are the K eigenvectors of 3 corresponding to the K positive eigenvalues with 

II7J2 = 1 . 

(c). The approximation error of the best rank k approximation to XB is 


IXB-X* 


K 


T || 2 

i W i II F = 


\XB-X(J2 ( *^7)\\ 2 F = \\ XI3 -XB k \\ 2 F = n Y, At*(S) 


i =1 


i— 1 




for any 1 < k < K. 
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By Theorem [2jJb), p k ( H) = cr 1/n = ||crfc 7 fc u^||p/n. So /ifc(E) can be viewed as a 
measure of the magnitude of the signal in the fc-th component of the SVD of X£>. Therefore, 
our choice of W and A makes the signal concentrated in the first few components as much 


as possible. Theorem 2.1 "c) implies that, even if K is not small, as long as /i*.(E) decreases 
fast enough, XB can be well approximated by the first few components. We call this the 
Szgnal Extraction multivariate /degression (SiER) method. 

To estimate a. kl we estimate B by 


B = -ix T (Y - l n y T )(Y - l n y T ) T X , (2.13) 

n z 

where y is the sample mean of yi, • • • , y n , and l n is an n-dimensional vector with all elements 
equal to one. In the classic setting of small p and large n, the estimates Si, • • •, olk can be 
sequentially obtained by solving 


max a T BcK, subject to a T Sa = 1, otTSa. = 0, 1 < l < k — 1, 

cteRP 


(2.14) 


where 1 < k < K. Once we obtain ct kl we have t k = XS^, which has zero sample mean. 


With the constraints in (2.14), tq, t 2 , ..., tx are orthogonal and satisfy ||tfc|||/rz = 1. Let 
T = [t'i, ■ • • , t k]- Then T X T = nl^, where I k is the K -dimensional identity matrix. The 
matrix W is estimated by regressing Y on ti, to,..., t k with the usual least squares method. 
That is, 

1 . 


W 1 = (T i T) _ 1 T i (Y — lny 1 ) — —T 1 (Y — l ra y), 


n 


and B = Siw^ + S 2 wJ + • • • + 


(2.15) 


is the estimate of B, and Bk = ci-iwj + S 2 wJ H-h is the estimate of Bk . Due to the 

orthogonality of t l5 1 2 , ..., t K , B k does not depend on the number of selected components in 
practice. The following lemma shows that in the special case of scalar response (q — 1), our 


estimate B in (2.15) is equivalent to the least squares method. But if q > 1, this method 
may not be the same as the least squares method. 


Lemma 1. Suppose that S is full rank. If q 
as the least squares estimate. 


1, the estimate in (2.15) is exactly the same 


We will introduce the sparse method for the high-dimensional setting in the next section. 
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Sparse estimates and oracle inequalities in high¬ 
dimensional settings 


3.1 Sparse estimates 


We make the following sparsity assumption: only a small number of the coefficient vectors, 
/3 1; • • • , /3 p , are nonzero vectors. Since these vectors are the row vectors of £>, this assumption 


is just the row-wise sparsity of B. The definition (2.6) implies that a./- is a sparse vector and 


the number of its nonzero coordinates is less than or equal to the number of nonzero vectors 
among /3 1 , ■ ■ ■ ,/3 p . Motivated by the sparsity of a we propose the following penalized 
optimization problem whose solution is the sparse estimate a /. of ak: 


a 1 Bo 


max 


cteRp aTSa: + 7 “||ck||^ ’ 
subject to ra T Sa: = 1, a^Sa = 0, 1 < l < k — 1, 


(3.1) 


where ||ck||| = (1 — A)||ck||| + A||a||^, and both r > 0 and 0 < A < 1 are tuning parameters. 
In the penalty r||a||^, the I 2 term is used to overcome the singularity problem of S and the 


li term encourages the sparsity of The penalty r||a:||| was introduced in Qi, Luo and 


Zhao (2013) for sparse principal component analysis and utilized in Qi, Luo, Carroll and 


Zhao (2015) for sparse regression and sparse discriminant analysis. The main reason that 


we use the squared l± norm instead of the l\ norm itself as in the elastic-net is to make the 


objective function in (3.1) scale-invariant, that is, if we replace « by tct, where t is any 


nonzero number, the value of the objective function is unchanged. This property plays an 
important role in our theoretical development and algorithms. Due to the scale-invariant 


property, (3.1) is equivalent to 


liiaxadBa: subject to ck t Sq: + r||o:||^ < 1 and a 1 = 0, 1 < l < k — 1. (3.2) 


In fact, the solutions of (3.1) and (3.2) differ only by a scale factor. We have proposed 


algorithms to solve a more general optimization problem (see the problem (4.3) in Qi et al. 


(2015)) than (3.2). By a proper rescaling of the solution to (3.2), we obtain a^. With the 


constraints in (3.1), the estimates = Xq^, 1 < k < K, are still orthogonal to each other 
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and satisfy ||tfc|||/n = 1. Therefore, we can use (2.15) to get the estimates W and B. In 


the special case of scalar response, the proposed method is just the sparse regression by 


projection method proposed in Qi et al. (2015) 


3.2 Oracle inequalities 

In this section, we provide oracle inequalities for the estimates of a*,, w*, and B in high¬ 
dimensional settings. These oracle inequalities hold for any n and p. 


We follow the notations in Bickel et al. (2009). For any p-dimensional vector a = 
(a!,--* , a p ) T , let J(a) = {j G {1, • • ■ ,p} : aj ^ 0} denote the collection of indices of 
nonzero coordinates of a and At (a) = |./(a)| denote the number of nonzero coordinates of a, 
where | J(a)| is the cardinality of J(a). At (a) is a measure of the sparsity of a. Similarly, for 
any p x q matrix B with row vectors, f3 1 , • • •, /3 , we define J(B) — {j G {1, ■ ■ • , p} : /3 ■ ^ 0 } 
and M(B) = \J{B)\ which is a measure of the row-wise sparsity of B. It follows from 
that 


J(cx-k) C J(B), and hence At (a*,) < A t(B), VI < k < K. 


(3.3) 


Before we provide the main results, we first show that B k has nearly the smallest expected 
prediction error among all rank k coefficient matrix estimations when n is large. Let x liew 
be a new observation of the predictor vector and S the covariance matrix of x new . The 


corresponding new response is (y 


new\T 


= X 


new\T 


) T B + (e new ) T , where e new is independent of 


Theorem 3.1. Suppose that ||S — £||oo < Cwhere C is a constant which does not 
depend on n and p. Then we have 


E | ||fx new ) T £, - (y new ) T ||l] < min E ||(x new ) T ^ - (y new ) T ||l + 2C\\B\\ 2 F M(B)< ,l ° gP 


B k 


n 


where the minimum is taken over all possible B k of the forms Xq=i bi v 7 with arbitrary 


bj G and Vj G M 9 . 


Note that when q — 1 , Af (B)y/\ogp/n is the convergence rate of the LASSO and the 


Dantzig selector (Bickel et al., 2009). Under the sparsity assumption that when n, p —> oo, 
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\\B\\ 2 F M(B)^/logp/n —> 0, the expected prediction error of Bk is close to the smallest one 
among all possible rank k approximation to B when n and p are large. We assume that 
< C ^/logp/n, because it has been shown (Equation (A14) 


IS-XI 


m 


Bickel and Levina 


(2008)) that -y/log p/n is the order of the max norm of the difference between the sample 
covariance matrix and the population covariance matrix of p-dimensional multivariate normal 
distribution. 

Now we state three regularity conditions for the main theorems. In the setting of large 


p and small n, the identification problem exists for the model (|2.1|). That is, there exists 
B 7 ^ B such that X£> = X£>. 


Bickel et al. 


(2009) imposed the following restricted eigenvalue 


assumptions on X, which we will also adopt here. 
Condition 1. Let s = M.(B) and 


|Xd| 


k = mm 

JoC{l,- ,p}, 

| Jo|<S 


mm 

0 ^< 5 eR p , 

l^ J 0 111— C ll^ J 0 I 




where c > 1 and k > 0 are two constants, S j 0 and Sjc are the subvectors of S consisted of 
the coordinates with indices belonging to Jo and respectively. 

A consequence of this restricted eigenvalue assumption is that for any two p-dimensional 
vectors, a and ex.' with sparsity JA(ot) < s and At (a') < s, if Xa = Xa', then we have 


ol = ol' (see the second remark after Theorem 7.3 in Bickel et al. (2009)). Therefore, the 


model ( 2 . 1 ) is identifiable among all coefficient matrices with row-wise sparsity less than or 


equal to M{B). Therefore, if XB' = XB and M(B') < s = M(B), then B' = B. 

The next regularity condition is on the distribution of the (/-dimensional noise vector £ t . 


Condition 2. The random error vectors Si, 1 < i < n, are i.i.d. mean zero M 9 -valued Gaus¬ 
sian variables with median M e and variance a 2 , where M e is defined as the median of the real¬ 
valued random variable ||Sj|| 2 and the variance is defined as a 2 = sup ugK9 || u ||, ;=1 E , [(u T £ i ) 2 ] 


(Section 3.1 in Ledoux and Talagrand (2011)). 


Note that (u T £;) is the projection of onto the direction of u, and has a normal distri¬ 
bution. Hence, a 2 is the maximum of the variances of the projections of Si along all possible 
directions in M 9 . In the special case q = 1, a 2 is the just usual variance. In our theoretical 
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development, we need to estimate the tail probabilities of ||Sj || 2 which can be controlled by 


M e and a 2 (Section 3.1 in Ledoux and Talagrand (2011)). Bunea et al. (2012) assumes that 
the coordinates {e^ : 1 < j < q} of £ t have independent and identical normal distributions. 
Condition [2] is weaker and allows correlations among £ il; ..., e iq . 

Condition 3. All the diagonal elements of S = X T X/n are equal to 1 and there exist positive 
constants, c 2 and c 3 , such that 


(a). min( Ml(g) 7 ^ 2(s) , w( B) ~y B) • • , ^-i(S)-vk(s) | > 

1 J f H l(=) ’ /«(=) ’ ’ /Uc-l(=0 \ ~ 


(b). M s )/W(3) < c 3 . 


Bickcl et al. (2009) assumed that the diagonal elements of S = X T X/n are equal to 1, 


which can be achieved by scaling X. Condition [3] (a) prevents the cases where the spacing 
between adjacent eigenvalues is so small that the eigenvalues cannot be well separated based 
on noisy observations. Condition [3] (b) excludes the situations where the magnitudes of the 
higher order components are too small compared to those of lower order components. 

As mentioned in Section [ 2 j in the classic setting, if q > 1 , our method (2.13) may not be 
the same as the least squares method. For the integrity of this paper, below we first provide 


the asymptotic result for our method in Theorem 3.2 under the setting when p is fixed and 
n goes to infinity, and then provide the theoretical property of our sparse estimates for high 
dimensional setting afterward. 


Theorem 3.2. Suppose that Conditions hold with M s , a 2 and C 3 bounded above and c 2 
bounded below as n —> 00 . Moreover, we assume that there exists c 0 > 0 independent of 
n such that S is positive definite and its smallest eigenvalue, A m j n (S) > Co, for all n large 
enough. Then we have 


\\B - B\\ F = O p {l/^n), ||cKfc — CKfclb = 1/2 O p {l/Vn), 


(3.4) 


for any 1 < k < K. 

I 11 the rest of this section, we study the theoretical property of our sparse estimates 
ct k and B when both n,p —> 00 . We first provide upper bounds on the l k sparsity of 
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6tk, 1 < k < K, and the oracle inequalities for them in the following theorem. Although 


we use the same tuning parameters (r, A) in (3.1) for all components in practice for the 
computational efficiency, in our theoretical results, we allow different tuning parameters 
for different components. We use [r^ k \ \^ k >) to denote the tuning parameters for the k- th 
component, 1 < k < K. Let 7 fc = Z= t k/y/n, 1 < k < K, which are estimates of 
7 fc = ZcKfc = t k/\/n. Define 


zu = 


logp 


n 


and recall that s = 


Theorem 3.3. Assume that Conditions 


hold. Suppose that 


PifE) A h 2 Clw 2 s/ k 2 , where C 0 = 2 max{M e /\/log 2 , 2a}, 


(3.5) 


k is the constant in Condition^ and h is a constant. Let the tuning parameters (r^ k \ A^), 
1 < k < K, satisfy conditions 


(*) = 

“ IMhv/^p)’ 


c - 1 + 4 < \ {k) < l, 


(3.6) 


where A^ and S 0 are positive constants such that c 1 + h 0 < 1; an d c is the constant in 
Condition Q] 


(a). For the first component (k = 1), there exist constants Af and ho which only depend on 
c, C 2 and <5o, where c 2 is the constant in Condition [3] (a), such that with probability at 
least 1 — 2e M l/ 2cr “p 1_c 'o/ 4o ' 2 ; if A^ > Af and h > h 0 , we have 

IIS1II1 < a/DcH c»(i|| 1 < VGcs/k, ( 3 . 7 ) 

11! — CK! || ! < 4(1 + c)(l + V0c)cf l A^CqK^ 2 

-||X(Si - c»£i)Hi = ||7! - 7J 2 < 16 c7 2 (1 + a/6c) 2 (A^ 1) C 0 / k) 2 pifS)~ 1 w 2 s. 

n 


(b). For the higher order components (1 < k < K), we further assume that 

max | ctk || 1 < C4 min ||aJ|i, k ~ 2 < C5 1 |oti|| 1, ( 3 . 8 ) 

1 <k<K " 1 <k<K " 

where c 4 and c 5 are two constants. Then there exist constants h 0 , Aj < A^, 1 < 
j < K, which only depend on So, c, c 2 ~ c 5 , such that with probability at least 1 — 
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2 gM|/ 2 o- 2 p i cg/ 4 o- 2 ^ j or an y i < k < K, if Aj < < AV, 1 < j < k, A^ > Af and 

h> h 0 , we have 

||a*||i < Dk,iy/s/n, ||a fc - a fc ||i < D kA A ik) C 0 K~ 2 /ii(H) _1/2 ws,, (3.9) 

^l|X(a fc - Q£*)||i = ||7fc - 'Yfclll < D k:2 (A {k) C 0 /k) 2 /i 1 {S)~ 1 w 2 s , 
where D k , 1 , D k ,2 and D kj4 are constants only depending on 5 q, c, c 2 ~ C 5 . 


In the upper bounds above, only C 0 depends on the distribution of the noise vector 
£j and can be regarded as a measure of the magnitude of noise. Although we have two 


tuning parameters, rt fc ) and for each k, by Theorem 3.3 A^ is not essential for the 
convergence rates. Actually, it can be any number in a subinterval of the interval (c - 1 ,1] and 
does not affect the convergence rates. However, the choice of A^ does affect the predictive 
performance in the finite-sampling situations. We will propose methods to choose these two 
parameters in the following section. Now we provide the oracle inequalities for W, B and 
X£>. We will consider the case q — 1 first and then q > 1. When q = 1, we use /3 to denote 
the coefficient vector B and W = wj is a scalar. 

Theorem 3.4. Let q = 1 and e* ~ N(0,a 2 ). Suppose that Condition^holds and ||X/3|||/n > 
K 2 a 2 w 2 s/ k 2 , where k is the constant in Condition^ 7} Let the tuning parameters (r, A) satisfy 


Aaw 


t = 


c 1 T ^ — 1 ) 


I ^ 1 1 1 1 11 X/31 1 2 / yfn 


where A is a constant large enough and <5 0 is a constant satisfying c 1 + <5 0 < 1. Then with 
probability at least 1 — 2 \/ep~ 3 , we have 

|w x - Wi| < DiaK^wy/s, ||/3 — /3||i < D 3 aK~ 2 zus, ||X(/3 — /3)H 2 < y/nD^an^w^fs, 


where D 1; D 3 and D,± are constants only depending on A, c, h and c 2 
The upper bounds of ||/3 — /3||i and ||X(/3 — /3 )||2 in Theorem 


3.4 


are the same as those 


for the Lasso and the Dantzig selector (Bickel et al, 2009) except the constants. 

When q > 1, to measure the difference between B and B, we consider the L X 2 norm 
which is defined by 

1 


|M||i2 — 


1 1/2 


E Ew 

j =1 \*=1 


ij\ 
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where M is any l x m matrix and M t] is the (i,j) element. Then for any matrix M, we have 
||M||i , 2 > ||M||i?. Therefore, convergence under the norm is stronger than that under 
the Frobenius norm. If m = 1, L 12 norm is just the l\ norm for a vector. 


Theorem 3.5. Suppose that all the conditions in Theorem 3.S j hold. Then with probability 
at least 1 — 2e M ^ 2a ~ J p l ~ c oA a ~ ; for any 1 < K 0 < K, if A% < < A%, 1 < k < K 0 , and 


A( r o) > a l Kq , we have 


||w fc - w fc || 2 < L kt iC 0 w^fs/ k, || B k - B k \\l,2 < L KQ) 2,Cows/ k 2 , 

||XB fc — XiBfcHir < \fnLK 0 ,iCow^/s/ k , 


for 1 < k < Kq, where L kl \, L k $ and L k ^ are constants only depending on A^ for 1 < j < k, 
c, h and c 2 ~ C5. In particular, when K 0 = Ii, we have 

||B - 23|| 1,2 < L K ^C a ws/ k 2 , ||X(/5 - i3)||F < \fnL K ^C^w\fsfn. 

provides an upper bound on i?[||X(£> — B)\\ 2 F /n\, which is {qK/n + 
Kszu 2 } multiplied by a constant (they use different notations). This upper bound depends 
on q, the dimension of Y, and only if q/n —> 0, the upper bound converges to zero. Our 
bound holds for arbitrary q, even if q goes to infinity faster than n and p. 

For any 1 < k < K, B k and X/5^ are estimates of B k and X£y,, respectively. Theorem 


enough, X£> can be well approximated Xi3fc with a relatively small k. 


3.5 and Theorem 2.1|(c) imply that, even if K is not small, as long as /ifc(E) decreases fast 


Bunea et al. 


(2012 


4 Choice of the number of components and tuning pa¬ 
rameters 

In this section, we propose a method to choose the tuning parameters and decide the number 
of components. We first provide the rationale behind the selection method and then provide 
the details of the method in Algorithm |4.1| In practice, to improve computational efficiency, 
we use the same tuning parameters for all components and denote them as (r, A). The 
theoretical results in previous section imply that A is not essential for the convergence rates. 
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However, it has effects on the prediction errors in the finite sample situations. With the 
penalty r||o:||| = r(l — A) ||ck || 2 +'rA|| o: || f, the coefficient of the squared l\ term is r A. Roughly 
speaking, the effect of (r, A) on the sparsity of solutions mainly depends on r A and thus a 
small r with a large A has a similar effect on the sparsity of solutions as that of a large r with 
a small A. Hence, to improve the computational efficiency, we do not consider all the pairs 
of (r, A) in a two dimensional grid. Instead, we will select the parameters from a sequence of 
pairs where with the increase of r, A also increases. Specifically, in the following simulation 
studies and applications, we choose (r, A) from the following paired values: (0.05,0.05), 
(0.1,0.05), (0.1,0.1), (0.5,0.1), (0.5,0.2), (1,0.2), (1,0.3), (5,0.3), (5,0.4), (10,0.4), (50,0.5), 
and (100, 0.6). 

For each of the 12 pairs of tuning parameters, we first determine K{, 1 < i < 12, the 
maximum number of components to be found. As Hk{ 3) = otf'Botk measures the signal 
magnitude of the k-th component and /4(S) = a^Ba*, is an estimate of /^(E), we only 
compute the first few components with large values of /4(E) and stop when /4(E) becomes 


small enough. On the other hand, by Theorem 2.1 (a), the number of components cannot 
exceed min (n,p,q). Based on these two considerations, we define 


K{ = minj/l^, K^}, where K\ L> = min (n,p,q), 


(i) 


(4.1) 


and I<f } = min lk> 1 : 




hl(“) + ‘ ‘ ‘ + hfc(^) 


< 0.05 


( 2 ) • • 

The K\ implies that we stop searching for higher order components if the signal magnitude 
of the kill component does not account for more than 5% (by default) of the first k compo¬ 
nents. Once we have determined all the AA, we use the cross-validation method to determine 
the tuning parameters and the optimal number of components, K opt . More specifically, we 
summarize the procedure in the following algorithm. 


Algorithm 4.1. 1. For the i-th paired value of the tuning parameters, 1 < i < 12, we 


determine K t using (4.1) and the whole data set. 


2. Use the five-fold cross-validation to determine the number of components and the tuning 
parameters. We split the whole data set into five subsets and repeat the following 
procedure. For 1 < l < 5, we use the l-th subset as the l-th validation set and all other 
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observations as the l-th training set. Then for the i-th pair of tuning parameter values, 
based on the l-th training set, 


(a) we estimate the first K, components (op,..., &%■) sequentially solving (3.1). 

(b) For each j — 1, • • • , K t , we define tj = Xaq as the j-th new predictor. We use the 
first j new predictors and ( |2.15[ ) to get the estimate B\- of the coefficient matrix, 
which is the estimate based on the l-th training data set, the i-th pair of tuning 
parameter values and the first j components. 


(c) Then we apply Bf- to the l-th validation data set to get the validation error, e®. 

(d) Finally, we calculate the average validation error, = (e^-f-fe^)/5, for the 

i-th pair of tuning parameter values and the first j components (j = 1, • • • , K,). 


Let e io j 0 = minjj e ir Then the i 0 -th paired value of tuning parameters is chosen, and 
the optimal number of components is K opt = j 0 . 


5 Simulation studies 


In this section, we compare the performance of the proposed SiER method with four related 


methods on simulated data. The first method is the SRRR (Chen and Huang, 2012) which 


assumes the reduced rank structure and estimates the lower rank decomposition of the 
coefficient matrix by solving a penalized least squares problem with a group-Lasso type 


penalty on the first lower rank matrix. The second method is RemMap (Peng et al 


2010 ) which does not assume the reduced rank structure and solves a penalized least squares 


problem with both row-wise and element-wise sparsity imposed on the coefficient matrix. The 


third method is the SPLS (Chun and Keles, 2010) which identifies sparse latent components 


by maximizing the covariance between them and the responses with sparsity penalty imposed. 
The last method is SepLasso which fits separate regression models using Lasso for each 
individual response. 

We consider three cases. In each case, we will consider the effects of different factors 
including the dimension of the responses (g), the number of predictors ( p ) and their cor¬ 
relations (p), and the magnitude (a 2 ) and correlation (r) of noises. In the first case, q is 
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small. In the last two cases, q is relatively large and we generate the coefficient matrices as 
the product of two lower rank matrices. For each setting, we repeat the following procedure 
50 times. In each replicate, we simulate 590 independent observations among which 90 are 
the training data and 500 are the test data. Then we apply each of the five methods to the 
training data to select the tuning parameters and the number of components, and construct 
the final model which is applied to the test data to obtain the test errors. The test error 
is defined as ||Y tesf — Y p? . erf |||,/500, where Y test is the 500 x q matrix of the test data and 
Y pre d is the corresponding predicted matrix. The mean squared prediction error (MSPE) is 
obtained by averaging the 50 test errors. 

5.1 Case 1 

We generate data using the model Y = XB + e. We fix p = 500, q = 3, and set B, } \ = l/\/l5 
for j = 1 ,..., 15, B j2 = 0.5/x/30 for j = 16,..., 45, B j3 = 0.25/^60 for j = 46,..., 105, 
and Bjk = 0 for others. For each i = 1 ,n, the first 150 predictors are generated from 
multivariate normal distribution (X a ,..., A/ 150 ) T ~ i\r 150 (0,E), where E has the 
th element E jf. = p ^~ k I for j, k = 1,..., 150, and the other predictors are independent 
normal variables, X t j ~ iV(0, 0.1 2 ) for j = 151,... ,p. The noise vector e* is generated from 
N q ( 0, <j 2 R ), where the correlation matrix R has diagonal elements 1 and off-diagonal elements 
r. In this and the following cases, we will use gcr 2 , the sum of the diagonal elements of the 
covariance matrix of the noise vector e*, as a measure of the noise level. We choose p = 0.3 
and 0.7, qc r 2 = 0.1 and 0.15, r = 0.2 and 0.9, and consider all their possible combinations. 

We draw the boxplots of MSPE in Figure [2] and show the average and standard deviation 
of MSPEs of 50 repeats in Table [l] The SiER has best predictive performance. In this 
case, q is small, and there is no factor structure and no obvious group sparsity structure, 
so it is understandable that the SRRR has larger prediction errors than RemMap since the 
latter takes advantages of both the row-wise and element-wise sparsity. The SPLS has the 
largest MSPE because it does not directly target on prediction of the responses and thus has 
disadvantage in terms of prediction. The correlation among predictors, p, has a large effect 
on MSPE for all methods. A stronger correlation leads to a smaller prediction error in all 
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the methods. The correlation in random error, r, does not have obvious effects on MSPE 
for all methods. Increasing in the magnitude of random error ( qa 2 ) unsurprisingly leads to 
higher MSPE. 



Figure 2: Boxplots of MSPE for different methods with different parameter settings 

(p,r,qa 2 ) in Case 1 of the simulation studies. 


5.2 Case 2 

In this case and Case 3, we generate the coefficient matrix as the product of two lower 
rank matrices, B = CD, where C is a p x K matrix and D is K x q. In this case, we fix 
K = 3 and choose (p, q) = (100, 20) or (150, 30). For the matrix C, each element in the first 
p 0 = 40 rows is independently generated from N( 0,1), the rest p — po rows are set to be zero. 
Then each column of C is scaled to unit norm. All elements in D are first independently 
generated from the uniform distribution between —1 and 1, and then each row of D is 
scaled to unit norm. The first 50 predictors in X are generated from A^ 50 (0,E), where E 
has diagonal elements 1 and off-diagonal elements p. All the other predictors are generated 
independently from N( 0, 0.1 2 ). The noise vector is generated from N q ( 0, cr 2 R), where R has 
diagonal elements 1 and off-diagonal elements r. We choose p = 0.3 and 0.7, qa 2 = 0.015 
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Tabic 1: The averages and standard deviations (in parenthesis) of the MSPEs of 50 replicates 
for Case 1 in the simulation studies. 



P 

r 

SiER 

SRRR 

RemMap 

SPLS 

SepLasso 

0.1 

0.3 

0.2 

0.295(0.039) 

0.395(0.050) 

0.336(0.067) 

0.848(0.180) 

0.504(0.087) 

0.9 

0.320(0.053) 

0.395(0.055) 

0.322(0.060) 

0.824(0.181) 

0.489(0.088) 


0.7 

0.2 

0.214(0.023) 

0.289(0.031) 

0.244(0.051) 

0.432(0.079) 

0.289(0.049) 


0.9 

0.217(0.028) 

0.282(0.030) 

0.243(0.050) 

0.430(0.075) 

0.282(0.047) 

0.15 

0.3 

0.2 

0.429(0.045) 

0.536(0.049) 

0.444(0.063) 

0.918(0.137) 

0.630(0.089) 

0.9 

0.427(0.051) 

0.501(0.045) 

0.440(0.068) 

0.923(0.202) 

0.597(0.095) 


0.7 

0.2 

0.283(0.028) 

0.376(0.037) 

0.323(0.044) 

0.475(0.079) 

0.377(0.053) 


0.9 

0.302(0.033) 

0.394(0.046) 

0.368(0.090) 

0.508(0.089) 

0.392(0.053) 


and 0.030, r = 0 and 0.5, and consider their all possible combinations. We show the average 
and standard deviation of MSPE in Table [2] The SiER has the smallest prediction errors in 
all the situations. The SRRR has a better prediction performance than RemMap because 
in this case, there are obvious reduced rank structure and row-wise sparsity structure. The 
MSPE of the first three methods are mostly affected by the magnitude of random error ( qa 2 ), 
and are less sensitive to the correlations of predictors and random errors, and the dimensions 
p and q. 

We also compare the dimension reduction and feature selection of all methods. The 
number of selected components, the sensitivity and specificity of variable selection are sum¬ 
marized in Table [3} Only three methods, the SiER, the SRRR and the SPLS, generate low 
rank latent components. In all settings, the SiER chooses the smallest number of components 
and is most efficient in dimension reduction. The SiER, SRRR and RemMap have sensitivity 
equal to one in all settings, that is, these three methods select all the true features. The 
SRRR has the highest specificity, and hence performs best in feature selection in this case. 
The SiER tends to select more features than the SRRR, the RemMap and the SPLS in this 
simulation setting. 
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Table 2: The averages and standard deviations (in parenthesis) of the MSPE for Case 2 in 
the simulation studies. 


(p,q) 

qa 2 

P 

r 

SiER 

SRRR 

RemMap 

SPLS 

SepLasso 

(100,20) 

0.015 

0.3 

0 

0.020(0.009) 

0.026(0.003) 

0.038(0.005) 

0.835(0.170) 

0.175(0.054) 

0.5 

0.020(0.005) 

0.025(0.003) 

0.037(0.005) 

0.839(0.168) 

0.167(0.049) 

0.7 

0 

0.019(0.001) 

0.035(0.005) 

0.043(0.004) 

0.473(0.109) 

0.268(0.057) 

0.5 

0.019(0.002) 

0.034(0.003) 

0.041(0.004) 

0.468(0.096) 

0.272(0.075) 

0.030 

0.3 

0 

0.037(0.001) 

0.043(0.003) 

0.063(0.001) 

0.894(0.167) 

0.241(0.066) 

0.5 

0.040(0.017) 

0.044(0.004) 

0.070(0.010) 

0.823(0.138) 

0.230(0.056) 

0.7 

0 

0.038(0.002) 

0.051(0.005) 

0.069(0.005) 

0.523(0.135) 

0.306(0.068) 

0.5 

0.037(0.003) 

0.049(0.005) 

0.069(0.007) 

0.527(0.114) 

0.298(0.060) 

(150,30) 

0.015 

0.3 

0 

0.018(0.001) 

0.025(0.003) 

0.040(0.004) 

1.015(0.172) 

0.475(0.151) 

0.5 

0.021(0.009) 

0.025(0.003) 

0.040(0.004) 

0.953(0.165) 

0.446(0.129) 

0.7 

0 

0.018(0.001) 

0.033(0.004) 

0.047(0.005) 

0.599(0.117) 

0.558(0.101) 

0.5 

0.018(0.001) 

0.032(0.003) 

0.046(0.005) 

0.603(0.110) 

0.575(0.094) 

0.030 

0.3 

0 

0.035(0.001) 

0.042(0.003) 

0.070(0.005) 

0.972(0.185) 

0.501(0.104) 

0.5 

0.036(0.002) 

0.042(0.004) 

0.073(0.011) 

1.023(0.183) 

0.581(0.205) 

0.7 

0 

0.036(0.001) 

0.050(0.005) 

0.073(0.006) 

0.625(0.164) 

0.610(0.098) 

0.5 

0.036(0.002) 

0.049(0.005) 

0.073(0.008) 

0.628(0.171) 

0.597(0.094) 


5.3 Case 3 

In this case, we still generate the coefficient matrix by B — CD. We increase the dimensions 
p and q , and introduce correlation among the coordinates of the row vectors of D. We take 
K — 3, p — 1000 or 2000, q = 100 or 200. The matrix C is generated in the same way as in 
Case 2 with p 0 = 100. Each row of D is first independently generated from the multivariate 
normal distribution N g (0, and then scaled to have unit norm. The (i, j)-th element in E^ 
is defined as exp[— (|i — j|/100) 7 ] for i, j = 1,..., q. Therefore, the row vectors of D can be 
viewed as a discretely observed Gaussian process at time points 1/100, 2/100, • • •. We choose 
7 = 1 or 2 to represent different correlation levels, where 7 = 2 leads to a smoother Gaussian 
process than 7 = 1. The first 200 predictors in X are generated from lV20o(0 >S)> with the 
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Table 3: The average number (K) of components selected for the SiER, SRRR and SPLS, 
and the sensitivity (Se) and specificity (Sp) of all methods in case 2 in the simulation studies. 


(p, q) 

qa 2 

P 

r 

SiER 

SRRR 

RemMap 

SPLS 

SepLasso 

K 

Se 

Sp 

K 

Se 

Sp 

K 

Se 

Sp 

k 

Se 

Sp 

k 

Se 

Sp 

(100,20) 

0.015 

0.3 

0 

3 

1 

0.58 

3 

1 

0.92 

- 

1 

0.83 

5 

0.77 

0.84 

- 

0.76 

0.25 

0.5 

3 

1 

0.60 

3 

1 

0.93 

- 

1 

0.83 

5 

0.76 

0.84 

- 

0.80 

0.21 

0.7 

0 

3 

1 

0.33 

3.22 

1 

0.94 

- 

1 

0.83 

5 

0.94 

0.77 

- 

0.20 

0.80 

0.5 

3 

1 

0.39 

3.02 

1 

0.94 

- 

1 

0.83 

5 

0.92 

0.77 

- 

0.49 

0.53 

0.030 

0.3 

0 

3 

1 

0.59 

3 

1 

0.90 

- 

1 

0.83 

5 

0.78 

0.80 

- 

1.00 

0.00 

0.5 

3 

1 

0.54 

3 

1 

0.92 

- 

1 

0.83 

5 

0.75 

0.86 

- 

1.00 

0.01 

0.7 

0 

3 

1 

0.47 

3.06 

1 

0.93 

- 

1 

0.81 

4.92 

0.94 

0.74 

- 

1.00 

0.02 

0.5 

3 

1 

0.48 

3.02 

1 

0.96 

- 

1 

0.82 

5 

0.93 

0.75 

- 

1.00 

0.02 

(150,30) 

0.015 

0.3 

0 

3 

1 

0.69 

3 

1 

0.96 

- 

1 

0.91 

5 

0.69 

0.90 

- 

1.00 

0.04 

0.5 

3 

1 

0.71 

3 

1 

0.96 

- 

1 

0.91 

4.98 

0.70 

0.89 

- 

1.00 

0.04 

0.7 

0 

3 

1 

0.50 

3.19 

1 

0.97 

- 

1 

0.91 

4.85 

0.86 

0.86 

- 

0.93 

0.12 

0.5 

3 

1 

0.57 

3 

1 

0.97 

- 

1 

0.91 

4.85 

0.83 

0.87 

- 

0.92 

0.14 

0.030 

0.3 

0 

3 

1 

0.75 

3 

1 

0.95 

- 

1 

0.91 

4.98 

0.65 

0.91 

- 

1.00 

0.03 

0.5 

3 

1 

0.71 

3 

1 

0.95 

- 

1 

0.91 

4.98 

0.73 

0.89 

- 

1.00 

0.05 

0.7 

0 

3 

1 

0.54 

3.04 

1 

0.96 

- 

1 

0.90 

4.70 

0.89 

0.86 

- 

0.92 

0.14 

0.5 

3 

1 

0.58 

3 

1 

0.96 

- 

1 

0.90 

4.77 

0.80 

0.86 

- 

0.92 

0.12 


(i, j)-th element equal p^ l ~^ (i, j — 1,..., 200). The rest p — 200 predictors are independently 
generated from iV(0,0.1 2 ). The random noise vector is generated from N q (0, cr 2 R), where R 
have diagonal elements 1 and off-diagonal elements 0.5. We fix qcr 2 = 0.15. We summarize 
the MSPE in Table [4], the dimension reduction and the sensitivity and specificity of variable 
selection in Table [5] In this case, the SRRR is not included because the heavy computation 
load makes it unavailable. As the noise level is fixed, the MSPE is not sensitive to 7 , p 
and q. The SiER has the lowest average MSPE among the four methods and choose less 
components than the SPLS. In this large dimension case, the SiER has both high sensitivity 
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and high specificity. 


Table 4: The averages and standard deviations (in parenthesis) of the MSPE for case 3. 


q 

p 

7 

SiER 

RemMap 

SPLS 

SepLasso 


1000 

1 

1.739(0.185) 

2.146(0.480) 

2.816(0.370) 

3.103(0.340) 

100 

2 

1.844(0.330) 

2.264(0.615) 

2.872(0.462) 

3.117(0.501) 

2000 

1 

1.748(0.244) 

2.108(0.485) 

3.026(0.432) 

3.202(0.408) 


2 

1.740(0.337) 

2.185(0.616) 

2.944(0.396) 

3.113(0.385) 


1000 

1 

1.763(0.208) 

2.180(0.419) 

2.765(0.235) 

3.131(0.488) 

200 

2 

1.774(0.304) 

2.186(0.574) 

2.843(0.320) 

3.070(0.321) 

2000 

1 

1.756(0.224) 

2.109(0.488) 

2.960(0.279) 

3.098(0.266) 


2 

1.808(0.280) 

2.243(0.611) 

3.004(0.360) 

3.161(0.354) 


6 Application to the communities and crime data 


The data set (Bache and Lichman 2013) contains the socio-economic information about a 


large number of communities over the United States and the corresponding crime data. The 
goal is to predict different types of crimes of a community based on various socio-economic 
variables of the community such as the median family income, per capita number of police 
officers, and so on. We first remove the variables with a large proportion of missing. We also 
remove the response (crime) variables with too many zero values as it is not appropriate to 
treat them as continuous variables. Due to the high skewness of the variables, we use the 
logarithms of each of the original variables plus one as the new variables. After excluding 
samples with extreme values, we get 1514 samples and 102 predictors. The response variables 
are the transformed number of Assault, Burglary, Larceny, Auto Theft, and Arson, the 
transformed total number of violent crimes, and the transformed total number of non-violent 
crimes, per 100,000 population. 

To evaluate the prediction performance and the selection of components and features, we 
repeat the following procedure 100 times: in each replicate, we randomly take 150 samples 
as training data, the remaining as test data, and apply all the five methods to the train data 
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Table 5: The average number (K) of components selected for the SiER and SPLS, and the 
sensitivity (Se) and specificity (Sp) of all methods in Case 3. 


Q 

P 

7 

SiER 

RemMap 

SPLS 

SepLasso 




K 

Se 

Sp 

K 

Se 

Sp 

K 

Se 

Sp 

K 

Se 

Sp 


1000 

1 

2.86 

0.84 

0.93 

- 

0.77 

0.93 

3.44 

0.45 

0.77 

- 

0.42 

0.86 

100 

2 

2.32 

0.75 

0.94 

- 

0.65 

0.94 

2.76 

0.41 

0.79 

- 

0.32 

0.89 

2000 

1 

3 

0.82 

0.97 

- 

0.77 

0.97 

3.06 

0.35 

0.84 

- 

0.27 

0.95 


2 

2.40 

0.76 

0.97 

- 

0.65 

0.97 

2.38 

0.30 

0.85 

- 

0.23 

0.95 


1000 

1 

3.04 

0.88 

0.92 

- 

0.79 

0.93 

3.54 

0.43 

0.81 

- 

0.51 

0.80 

200 

2 

2.34 

0.78 

0.93 

- 

0.71 

0.93 

2.70 

0.35 

0.83 

- 

0.39 

0.85 

2000 

1 

2.96 

0.85 

0.97 

- 

0.81 

0.96 

3.26 

0.35 

0.83 

- 

0.37 

0.91 


2 

2.66 

0.77 

0.97 

- 

0.68 

0.97 

2.50 

0.32 

0.86 

- 

0.27 

0.93 


to build predictive models and obtain the MSPE by applying the model to the test data. 
The boxplots of prediction errors of all methods are shown in Figure [3j The mean MSPEs 
of the SiER and SPLS are close and smaller than others. Table [6] shows the frequency of the 
number of selected components and the mean and standard deviation of number of selected 
features for each method. The SiER selects two components in 76% of the 100 replicates and 
three components in all the other replicates. Both the SRRR and the SPLS tend to select 
more components and they select five components with the highest frequency. 

Among the 100 replicates, 11 features are selected by the SiER over 90 times, such as the 
percentage of population who are divorced, the number of people living in urban areas, the 
percent of persons in dense housing, the number of kids born to never married, and so on. 
The histograms of nonzero coefficients of these variables for assault are shown in Figure |4j 
It can be seen that the percentage of kids in family housing with two parents (pctKids2Par) 
and the percentage of families with kids that are headed by two parents (pctFam2Par) have 
protective effects (with negative coefficients) in crime assault, and the other variables have 
positive associations with the number of assault. For other crime types, pctKids2Par and 
pctFam2Par also have protective effects. Percentages for divorce and dense housing, and the 
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SiER SRRR Rem Map SPLS LASSO 


Figure 3: Boxplots of prediction errors for different methods in the study for the crime 
data. 

Table 6: The frequency of the number of selected components (K) and the mean (sd) of the 
number of selected features over 100 simulations for the crime data. 


Methods 

K 

number of selected features 

2 

3 

4 

5 

SiER 

76 

24 

0 

0 

52.2(23.3) 

SRRR 

22 

12 

17 

49 

32.9(18.2) 

RemMap 

- 

- 

- 

- 

37.6(8.2) 

SPLS 

1 

1 

27 

71 

81.5(23.8) 

SepLasso 

- 

- 

- 

- 

32.7(7.4) 


number of kids born to never married are positively associated with crimes. Coefficients for 
variables of population living in urban areas and having foreign born are positive for arson, 
but negative for auto theft. 


7 Discussion 

In this paper, we propose a signal extracting approach for dimension reduction and regres¬ 
sion in multiple response linear model with high-dimensional predictor variables. Under the 
reduced rank assumptions on the coefficient matrix, we aim to estimate the optimal lower 
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Figure 4: Histograms of nonzero coefficients of variables that are selected over 90% times 
in predicting the number of assault. 

rank decomposition of the coefficient matrix in terms of approximating the regression func¬ 
tion. We establish a general eigenvalue problem and its sparse version for high-dimensional 
settings. The solution of these problems provides the estimate of the first lower rank matrix. 
Applying the least squares regression on the response variables and new predictors generated 
from the estimate of the first lower rank matrix, we obtain the estimate of the second lower 
rank matrix. In the high-dimensional setting, we establish the oracle inequalities for the 
estimation of the lower rank matrices, the coefficient matrix and the estimated regression 
function, allowing correlation among random errors for different response variables. We do 
not make restrictions on the dimension of the multivariate response. In the special case of 
the usual linear regression model with a scalar response, onr oracle inequalities provide upper 
bounds that have the same order as those for the Lasso and Dantzig selector. The simulation 
studies and the application to real data show that the proposed method has good prediction 
performance and is efficient in dimension reduction for various reduced rank models. 
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